### Dataset Summary

This dataset contains 4803 question/answer pairs extracted from the
[BioStars](https://www.biostars.org/) website. The site focuses on
bioinformatics, computational genomics, and biological data analysis.

# Dataset Location and Details

https://huggingface.co/datasets/cannin/biostars_qa

## Source Data as a single JSON file

This dataset was generated by downloading individual posts; only limited
metadata is included with the dataset. The following Zenodo dataset has the
entirety of the downloaded post content as a single JSON file.

https://zenodo.org/record/7813785

# Code Details

- Executing the script will perform the entire process end-to-end.
- `get_biostars_dataset()`: This function downloads the content from
  [Biostars API](https://www.biostars.org/info/api/); each post is downloaded as
  an individual JSON file
- `extract_accepted_data()`: This function loads the individual files to Pandas
  then extracts out question/answer pairs. Questions were included if they were
  an accepted answer and the question had at least 1 vote. The content is then
  formatted as a Apache Parquet dataset with columns: INSTRUCTION, RESPONSE,
  SOURCE, METADATA
